Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: separate dataframe and dataset #865

Merged
merged 3 commits into from
Sep 8, 2020
Merged

Conversation

maartenbreddels
Copy link
Member

We're back to Dataset again 😄. An immutable mapping of name to column.

The idea is to abstract away the data part of a DataFrame, which gives us a few advantages.

  • We can have lazy loading of data, and reading of multiple columns in 1 go
  • Dataset and thus DataFrames can be serialized (e.g. only store the hdf5 or arrow path, not serialize the data), so we can efficiently pickle/serialize/transport them with use of Dask/Ray/Mars.

This also starts with the idea of identifying data with a hash key to quickly compare data, which makes caching easier (e.g. also when using Dask), or when we want to cache complex operations (groupby).

We will not have subclasses of DataFrame anymore (except for DataFrameLocal and DataFrameRemote), all data specific
parts are done in the Dataset.

@maartenbreddels
Copy link
Member Author

I can get it to work with both ray and dask, but dask crashes when doing distributed, not sure why. I think we want to ray/dask stuff on a different PR, but for now i keep this together. 2 commits in 1 PR would also be fine I think.

@maartenbreddels maartenbreddels force-pushed the refactor_dataset branch 7 times, most recently from 421bbcc to 2b2f178 Compare September 5, 2020 15:05
@maartenbreddels maartenbreddels marked this pull request as ready for review September 8, 2020 06:29
@maartenbreddels maartenbreddels merged commit a21784b into master Sep 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant